Introduction to the use of weights for sampling/selection bias

Lorenzo Fabbri

ISGlobal

2024-01-11

Selection bias

  • From the subjects belonging to the HELIX sub-cohort (0th follow-up), only a fraction took part to the follow-up (1st follow-up).
  • That is, among the eligible subjects, some are going to be excluded from our analyses since they have e.g., no outcome.
  • Censoring from the analysis those with missing values might introduce selection bias.

A simple example

  • Suppose exposure A is a measure of SEP and outcome Y is diagnosis of ADHD. The censoring variable C is a collider on this pathway, and L is a confounder. To estimate the effect of A on Y we should NOT adjust for C (Figure 1).
Figure 1: Effect estimation with selection bias

A simple example

  • The problem is that in the 1st follow-up, we are implicitly adjusting for C as well, thus opening another path.

  • Thus, censoring due to loss to follow-up can introduce selection bias. Generally, we are interested then in estimating an effect if nobody had been censored. If A is binary, we are interested in:

    \[E\left[Y ^ {a=1,c=0}\right] - E\left[Y ^ {a=0,c=0}\right],\]

    which is the joint effect of A and C.

Reducing the effects of selection bias

  • IP weights can be used to estimate 1 our causal effect (to estimate the parameters of the MSM \(E\left[Y^{a,c=0}\right] = \beta_0 + \beta_1a\)):

    \[W ^ {A,C} = W^A \times W^C,\]

    with \(W^C = 1 / Pr\left[C=0|L,A\right]\) (\(W^C=0\) for the censored), and \(W^A\) being the weights to adjust for confounding (\(f(A|L)\)).

  • Alternatively, we can compute the stabilized IP weights:

    \[SW^C = Pr\left[C=0|A\right] / Pr\left[C=0|L,A\right].\]

  • Alternatively, we can use standardization (outcome modeling) to adjust for confounding and selection bias.

An actual example

  • Reweighting the UK Biobank to reflect its underlying sampling population substantially reduces pervasive selection bias due to volunteering (Alten et al. 2022).
  • UKB is not representative of the underlying sampling population. That is, among all the subjects eligible, most did not take part (volunteer bias).
  • The authors modeled the selection process to correct for such biases using inverse probability weights, with data from the UK census 1.

The SelectionWeights R package

  • Since weight estimation depends on question-specific confounders and eventually exposures, it makes sense for the researcher to estimate them directly.
  • The aim (work in progress) is to create a R package (SelectionWeights) for that:
sel_weights <- SelectionWeights::estimate_selection_weights(
  dat = dat,
  id_str = "HelixID",
  ids_not_censored = ids_not_censored,
  formula = "sex + age + SEP",
  method_estimation = "glm",
  link_function = "gaussian",
  stabilized = TRUE,
  winsorization = 0.9,
  estimate_by = "cohort",
  sampling_weights = NULL,
  moments = NULL,
  interactions = NULL,
  library_sl = NULL,
  cv_control_sl = NULL,
  discrete_sl = NULL
)

mod <- glm(
  outcome ~ exposure + sex + age + SEP,
  data = dat,
  weights = sel_weights
)

Exploring balance

  • Whether the estimated weights are actually working, can be determined with tools like the Love plot.

Love plot
  • Another element to take into account is the effective sample size after weighting.

What’s next?

  • Finish the R package, including option to export the results in a nice format (e.g., docx tables for your papers).
  • Add a vignette to explain what it does and how to use it.
  • Add some unit tests to assess code correctness.

References

Alten, Sjoerd van, Benjamin W Domingue, Titus Galama, and Andries T Marees. 2022. “Reweighting the UK Biobank to Reflect Its Underlying Sampling Population Substantially Reduces Pervasive Selection Bias Due to Volunteering.” medRxiv, 2022–05.
Hernán, Miguel A, and James M Robins. 2010. “Causal Inference.” CRC Boca Raton, FL.